BoomBikes Company -A bike-sharing system/provider is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
The company wants to know:
1.Which variables are significant in predicting the demand for shared bikes.
2.How well those variables describe the bike demands
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
# Import header files
import numpy as np
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
#load/read data
Bike = pd.read_csv("day.csv", low_memory=False)
Bike.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 730 entries, 0 to 729 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 730 non-null int64 1 dteday 730 non-null object 2 season 730 non-null int64 3 yr 730 non-null int64 4 mnth 730 non-null int64 5 holiday 730 non-null int64 6 weekday 730 non-null int64 7 workingday 730 non-null int64 8 weathersit 730 non-null int64 9 temp 730 non-null float64 10 atemp 730 non-null float64 11 hum 730 non-null float64 12 windspeed 730 non-null float64 13 casual 730 non-null int64 14 registered 730 non-null int64 15 cnt 730 non-null int64 dtypes: float64(4), int64(11), object(1) memory usage: 91.4+ KB
All are numberic variables and no missing data
day.csv have the following fields:
#check the size
Bike.shape
(730, 16)
There are 730 rows and 16 columns
#check the head of dataset
Bike.head()
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 01-01-2018 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 14.110847 | 18.18125 | 80.5833 | 10.749882 | 331 | 654 | 985 |
| 1 | 2 | 02-01-2018 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 14.902598 | 17.68695 | 69.6087 | 16.652113 | 131 | 670 | 801 |
| 2 | 3 | 03-01-2018 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 8.050924 | 9.47025 | 43.7273 | 16.636703 | 120 | 1229 | 1349 |
| 3 | 4 | 04-01-2018 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 8.200000 | 10.60610 | 59.0435 | 10.739832 | 108 | 1454 | 1562 |
| 4 | 5 | 05-01-2018 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 9.305237 | 11.46350 | 43.6957 | 12.522300 | 82 | 1518 | 1600 |
#check the tail of dataset
Bike.tail()
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 725 | 726 | 27-12-2019 | 1 | 1 | 12 | 0 | 4 | 1 | 2 | 10.420847 | 11.33210 | 65.2917 | 23.458911 | 247 | 1867 | 2114 |
| 726 | 727 | 28-12-2019 | 1 | 1 | 12 | 0 | 5 | 1 | 2 | 10.386653 | 12.75230 | 59.0000 | 10.416557 | 644 | 2451 | 3095 |
| 727 | 728 | 29-12-2019 | 1 | 1 | 12 | 0 | 6 | 0 | 2 | 10.386653 | 12.12000 | 75.2917 | 8.333661 | 159 | 1182 | 1341 |
| 728 | 729 | 30-12-2019 | 1 | 1 | 12 | 0 | 0 | 0 | 1 | 10.489153 | 11.58500 | 48.3333 | 23.500518 | 364 | 1432 | 1796 |
| 729 | 730 | 31-12-2019 | 1 | 1 | 12 | 0 | 1 | 1 | 2 | 8.849153 | 11.17435 | 57.7500 | 10.374682 | 439 | 2290 | 2729 |
Instant column is the index value - which is not needed
Bike.describe()
| instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 | 730.000000 |
| mean | 365.500000 | 2.498630 | 0.500000 | 6.526027 | 0.028767 | 2.997260 | 0.683562 | 1.394521 | 20.319259 | 23.726322 | 62.765175 | 12.763620 | 849.249315 | 3658.757534 | 4508.006849 |
| std | 210.877136 | 1.110184 | 0.500343 | 3.450215 | 0.167266 | 2.006161 | 0.465405 | 0.544807 | 7.506729 | 8.150308 | 14.237589 | 5.195841 | 686.479875 | 1559.758728 | 1936.011647 |
| min | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2.424346 | 3.953480 | 0.000000 | 1.500244 | 2.000000 | 20.000000 | 22.000000 |
| 25% | 183.250000 | 2.000000 | 0.000000 | 4.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 13.811885 | 16.889713 | 52.000000 | 9.041650 | 316.250000 | 2502.250000 | 3169.750000 |
| 50% | 365.500000 | 3.000000 | 0.500000 | 7.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 20.465826 | 24.368225 | 62.625000 | 12.125325 | 717.000000 | 3664.500000 | 4548.500000 |
| 75% | 547.750000 | 3.000000 | 1.000000 | 10.000000 | 0.000000 | 5.000000 | 1.000000 | 2.000000 | 26.880615 | 30.445775 | 72.989575 | 15.625589 | 1096.500000 | 4783.250000 | 5966.000000 |
| max | 730.000000 | 4.000000 | 1.000000 | 12.000000 | 1.000000 | 6.000000 | 1.000000 | 3.000000 | 35.328347 | 42.044800 | 97.250000 | 34.000021 | 3410.000000 | 6946.000000 | 8714.000000 |
1.There are no missing data.
2.All values are numeric.
3.There are 730 entries and 16 columns.
4.Values are in different ranges.
All the data in the table Numeric except Date
# lets check the correlation to see which variables are highly correlated
plt.figure(figsize = (15,10))
sns.heatmap(Bike.corr(), annot = True, cmap="YlGnBu")
plt.show()
High correlation observed for cnt with temp and atemp ( ignoring instant , casual,registered)
sns.pairplot(Bike)
plt.show()
#Drop column list
datalist = ['instant','dteday', 'casual', 'registered']
#Drop columns
Bike = Bike.drop(datalist, axis = 1)
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
sns.boxplot(x='weathersit', y = 'cnt', data = Bike)
plt.subplot(2,2,2)
sns.boxplot(x='yr', y = 'cnt', data = Bike)
plt.subplot(2,2,3)
sns.boxplot(x='season', y = 'cnt', data = Bike)
plt.subplot(2,2,4)
sns.boxplot(x='mnth', y = 'cnt', data = Bike)
plt.show()
# as there are two years lets plot and understand in detail
# Multivariante analysis - cnt, year, month
plt.figure(figsize=(10,5))
sns.boxplot(x='mnth', y = 'cnt', hue = 'yr', data = Bike)
plt.show()
plt.figure(figsize = (20,6))
plt.subplot(1,3,1)
sns.boxplot(x='holiday', y = 'cnt', data = Bike)
plt.subplot(1,3,2)
sns.boxplot(x='weekday', y = 'cnt', data = Bike)
plt.subplot(1,3,3)
sns.boxplot(x='workingday', y = 'cnt', data = Bike)
plt.show()
datalist = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']
sns.pairplot(Bike[datalist])
plt.show()
# Check correlation for the same variables
plt.figure(figsize = (15,10))
sns.heatmap(Bike[datalist].corr(), annot = True, cmap="YlGnBu")
plt.show()
Categorical Variables :
Note: No need to create dummy variables for holiday and working day as it is similar to yes and no and holding 0 and 1 value
seasons = pd.get_dummies(Bike.season, drop_first = True)
seasons
| 2 | 3 | 4 | |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 |
| ... | ... | ... | ... |
| 725 | 0 | 0 | 0 |
| 726 | 0 | 0 | 0 |
| 727 | 0 | 0 | 0 |
| 728 | 0 | 0 | 0 |
| 729 | 0 | 0 | 0 |
730 rows × 3 columns
#renaming column names for seasons table
seasons.rename(columns={2:'summer', 3:'fall', 4:'winter'}, inplace = True)
seasons
| summer | fall | winter | |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 |
| ... | ... | ... | ... |
| 725 | 0 | 0 | 0 |
| 726 | 0 | 0 | 0 |
| 727 | 0 | 0 | 0 |
| 728 | 0 | 0 | 0 |
| 729 | 0 | 0 | 0 |
730 rows × 3 columns
# Dummy variables for weekday (sunday is dropped) and renaming columns name
weekday = pd.get_dummies(Bike.weekday,drop_first = True)
weekday.rename(columns={1:'mon', 2:'tue', 3:'wed', 4:'thu', 5:'fri', 6:'sat'}, inplace = True)
weekday
| mon | tue | wed | thu | fri | sat | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 725 | 0 | 0 | 0 | 1 | 0 | 0 |
| 726 | 0 | 0 | 0 | 0 | 1 | 0 |
| 727 | 0 | 0 | 0 | 0 | 0 | 1 |
| 728 | 0 | 0 | 0 | 0 | 0 | 0 |
| 729 | 1 | 0 | 0 | 0 | 0 | 0 |
730 rows × 6 columns
# Dummy variables for weather (clear weather is dropped) and renaming columns name
weather = pd.get_dummies(Bike.weathersit,drop_first = True)
weather.rename(columns={1:'w_clear', 2:'w_mist', 3:'w_light'}, inplace = True)
weather
| w_mist | w_light | |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 1 | 0 |
| 2 | 0 | 0 |
| 3 | 0 | 0 |
| 4 | 0 | 0 |
| ... | ... | ... |
| 725 | 1 | 0 |
| 726 | 1 | 0 |
| 727 | 1 | 0 |
| 728 | 0 | 0 |
| 729 | 1 | 0 |
730 rows × 2 columns
#Dummy variables for year
year = pd.get_dummies(Bike.yr, drop_first = True)
year.rename(columns={1:'yr_2019'}, inplace = True)
year
| yr_2019 | |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| ... | ... |
| 725 | 1 |
| 726 | 1 |
| 727 | 1 |
| 728 | 1 |
| 729 | 1 |
730 rows × 1 columns
# Dummy variables for weather (jan is dropped) and add prefix as month to the columns
month = pd.get_dummies(Bike.mnth,drop_first= True)
month = month.add_prefix("month_")
month
| month_2 | month_3 | month_4 | month_5 | month_6 | month_7 | month_8 | month_9 | month_10 | month_11 | month_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 725 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 726 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 727 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 728 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 729 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
730 rows × 11 columns
# Creating list for catgorical columns
catlist = ['season', 'yr', 'mnth', 'weekday', 'weathersit']
# Drop the categorical columns from the Bike table.
Bike_New = Bike.drop(catlist, axis = 1)
#Creating table for all dummy variables(categorial variables)
cat_concat = pd.concat([seasons, year, month, weekday, weather], axis = 1)
# Concat the dummy variable table with the Bike_New (main)
Bike_New = pd.concat([Bike_New,cat_concat], axis = 1)
Bike_New.shape
(730, 30)
There are 30 columns after coverting the categorical variables into dummy
Bike_New.head()
| holiday | workingday | temp | atemp | hum | windspeed | cnt | summer | fall | winter | ... | month_11 | month_12 | mon | tue | wed | thu | fri | sat | w_mist | w_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 14.110847 | 18.18125 | 80.5833 | 10.749882 | 985 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 14.902598 | 17.68695 | 69.6087 | 16.652113 | 801 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 1 | 8.050924 | 9.47025 | 43.7273 | 16.636703 | 1349 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 1 | 8.200000 | 10.60610 | 59.0435 | 10.739832 | 1562 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1 | 9.305237 | 11.46350 | 43.6957 | 12.522300 | 1600 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 30 columns
#import sklearn for splitting
from sklearn.model_selection import train_test_split
np.random.seed(0)
# Data split into 70:30 ratio of train and test
df_train, df_test = train_test_split(Bike_New,train_size = 0.7, test_size = 0.3, random_state =100 )
as we saw the data was in different ranges -Scaling will not impact any parameters except co-efficient and this will help ease of interpretation of linear model
We are using MinMax scaling
# Import
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# Apply scaler() to all the columns except dummy and 0, 1 variables
num_vars = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_train.head()
| holiday | workingday | temp | atemp | hum | windspeed | cnt | summer | fall | winter | ... | month_11 | month_12 | mon | tue | wed | thu | fri | sat | w_mist | w_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 653 | 0 | 1 | 0.509887 | 0.501133 | 0.575354 | 0.300794 | 0.864243 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 576 | 0 | 1 | 0.815169 | 0.766351 | 0.725633 | 0.264686 | 0.827658 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 426 | 0 | 0 | 0.442393 | 0.438975 | 0.640189 | 0.255342 | 0.465255 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 728 | 0 | 0 | 0.245101 | 0.200348 | 0.498067 | 0.663106 | 0.204096 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 482 | 0 | 0 | 0.395666 | 0.391735 | 0.504508 | 0.188475 | 0.482973 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
5 rows × 30 columns
df_train.describe()
| holiday | workingday | temp | atemp | hum | windspeed | cnt | summer | fall | winter | ... | month_11 | month_12 | mon | tue | wed | thu | fri | sat | w_mist | w_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.00000 | ... | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 | 510.000000 |
| mean | 0.025490 | 0.676471 | 0.537262 | 0.512989 | 0.650369 | 0.320768 | 0.513620 | 0.245098 | 0.262745 | 0.24902 | ... | 0.086275 | 0.084314 | 0.150980 | 0.131373 | 0.158824 | 0.133333 | 0.127451 | 0.154902 | 0.343137 | 0.029412 |
| std | 0.157763 | 0.468282 | 0.225844 | 0.212385 | 0.145882 | 0.169797 | 0.224593 | 0.430568 | 0.440557 | 0.43287 | ... | 0.281045 | 0.278131 | 0.358381 | 0.338139 | 0.365870 | 0.340268 | 0.333805 | 0.362166 | 0.475223 | 0.169124 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.339853 | 0.332086 | 0.538643 | 0.199179 | 0.356420 | 0.000000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 1.000000 | 0.540519 | 0.526811 | 0.653714 | 0.296763 | 0.518638 | 0.000000 | 0.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 1.000000 | 0.735215 | 0.688457 | 0.754830 | 0.414447 | 0.684710 | 0.000000 | 1.000000 | 0.00000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 30 columns
# Lets check the correlation
plt.figure(figsize = (30,30))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()
# Pair plot with the temp and cnt to check the relationship
plt.figure(figsize = [6,6])
plt.scatter(df_train.cnt,df_train.temp)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
y_train = df_train.pop('cnt')
X_train = df_train
X_train.head()
| holiday | workingday | temp | atemp | hum | windspeed | summer | fall | winter | yr_2019 | ... | month_11 | month_12 | mon | tue | wed | thu | fri | sat | w_mist | w_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 653 | 0 | 1 | 0.509887 | 0.501133 | 0.575354 | 0.300794 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 576 | 0 | 1 | 0.815169 | 0.766351 | 0.725633 | 0.264686 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 426 | 0 | 0 | 0.442393 | 0.438975 | 0.640189 | 0.255342 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 728 | 0 | 0 | 0.245101 | 0.200348 | 0.498067 | 0.663106 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 482 | 0 | 0 | 0.395666 | 0.391735 | 0.504508 | 0.188475 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
5 rows × 29 columns
Fit a regression line through the training data using statmodels + add constant using add_constant(X)
import statsmodels.api as sm
# add constant
X_train_lm = sm.add_constant(X_train)
#Create first fitting model
lr = sm.OLS(y_train,X_train_lm).fit()
lr.params
const 0.175618 holiday -0.042394 workingday 0.043879 temp 0.401322 atemp 0.050628 hum -0.151812 windspeed -0.184388 summer 0.086796 fall 0.048580 winter 0.153968 yr_2019 0.232208 month_2 0.030389 month_3 0.063853 month_4 0.062565 month_5 0.087257 month_6 0.060862 month_7 0.023289 month_8 0.078641 month_9 0.144371 month_10 0.070260 month_11 0.020783 month_12 0.016994 mon -0.009946 tue -0.007753 wed 0.005823 thu 0.001806 fri 0.011556 sat 0.054533 w_mist -0.061030 w_light -0.256697 dtype: float64
# check the summary
print(lr.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.853
Model: OLS Adj. R-squared: 0.845
Method: Least Squares F-statistic: 99.96
Date: Sun, 27 Aug 2023 Prob (F-statistic): 8.42e-181
Time: 16:05:17 Log-Likelihood: 528.03
No. Observations: 510 AIC: -998.1
Df Residuals: 481 BIC: -875.3
Df Model: 28
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1756 0.030 5.777 0.000 0.116 0.235
holiday -0.0424 0.024 -1.793 0.074 -0.089 0.004
workingday 0.0439 0.009 4.689 0.000 0.025 0.062
temp 0.4013 0.142 2.821 0.005 0.122 0.681
atemp 0.0506 0.138 0.366 0.714 -0.221 0.322
hum -0.1518 0.039 -3.940 0.000 -0.228 -0.076
windspeed -0.1844 0.026 -7.003 0.000 -0.236 -0.133
summer 0.0868 0.024 3.679 0.000 0.040 0.133
fall 0.0486 0.030 1.618 0.106 -0.010 0.108
winter 0.1540 0.026 5.932 0.000 0.103 0.205
yr_2019 0.2322 0.008 28.792 0.000 0.216 0.248
month_2 0.0304 0.021 1.474 0.141 -0.010 0.071
month_3 0.0639 0.022 2.857 0.004 0.020 0.108
month_4 0.0626 0.034 1.864 0.063 -0.003 0.129
month_5 0.0873 0.036 2.412 0.016 0.016 0.158
month_6 0.0609 0.039 1.556 0.120 -0.016 0.138
month_7 0.0233 0.044 0.529 0.597 -0.063 0.110
month_8 0.0786 0.042 1.873 0.062 -0.004 0.161
month_9 0.1444 0.037 3.853 0.000 0.071 0.218
month_10 0.0703 0.034 2.041 0.042 0.003 0.138
month_11 0.0208 0.033 0.633 0.527 -0.044 0.085
month_12 0.0170 0.027 0.641 0.522 -0.035 0.069
mon -0.0099 0.010 -1.023 0.307 -0.029 0.009
tue -0.0078 0.011 -0.695 0.488 -0.030 0.014
wed 0.0058 0.011 0.554 0.580 -0.015 0.026
thu 0.0018 0.011 0.165 0.869 -0.020 0.023
fri 0.0116 0.011 1.031 0.303 -0.010 0.034
sat 0.0545 0.015 3.757 0.000 0.026 0.083
w_mist -0.0610 0.010 -5.845 0.000 -0.082 -0.041
w_light -0.2567 0.026 -9.712 0.000 -0.309 -0.205
==============================================================================
Omnibus: 85.143 Durbin-Watson: 2.041
Prob(Omnibus): 0.000 Jarque-Bera (JB): 237.880
Skew: -0.809 Prob(JB): 2.21e-52
Kurtosis: 5.929 Cond. No. 1.22e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.17e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
VIF gives a basic quantitative idea about how much the feature variables are correllated with each other and it is important parameter in the linear regression
VIF = $ 1 / (1-R^2) $
#import the VIF lib
from statsmodels.stats.outliers_influence import variance_inflation_factor
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 0 | holiday | inf |
| 25 | fri | inf |
| 24 | thu | inf |
| 23 | wed | inf |
| 22 | tue | inf |
| 21 | mon | inf |
| 1 | workingday | inf |
| 2 | temp | 447.70 |
| 3 | atemp | 383.54 |
| 4 | hum | 20.79 |
| 7 | fall | 15.42 |
| 16 | month_8 | 11.01 |
| 8 | winter | 10.93 |
| 15 | month_7 | 9.62 |
| 6 | summer | 8.88 |
| 14 | month_6 | 7.35 |
| 17 | month_9 | 7.34 |
| 13 | month_5 | 7.16 |
| 18 | month_10 | 6.64 |
| 19 | month_11 | 5.99 |
| 12 | month_4 | 5.64 |
| 5 | windspeed | 4.71 |
| 20 | month_12 | 3.78 |
| 11 | month_3 | 3.06 |
| 27 | w_mist | 2.21 |
| 9 | yr_2019 | 2.09 |
| 26 | sat | 1.93 |
| 10 | month_2 | 1.71 |
| 28 | w_light | 1.23 |
For few features VIF value is infinite.
$ VIF_i = \frac{1}{1 - {R_i}^2} $,VIF will become infinite when $ {R_i}^2 $ is 1
Which means that the model is able to explain 100% or 1.0 of the variance which indicates the perfect fit
#import RFE and linear regression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Running RFE with the output feature = 20
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm, n_features_to_select = 20)
rfe = rfe.fit(X_train,y_train)
list(zip(X_train.columns, rfe.support_, rfe.ranking_))
[('holiday', True, 1),
('workingday', True, 1),
('temp', True, 1),
('atemp', True, 1),
('hum', True, 1),
('windspeed', True, 1),
('summer', True, 1),
('fall', True, 1),
('winter', True, 1),
('yr_2019', True, 1),
('month_2', False, 2),
('month_3', True, 1),
('month_4', True, 1),
('month_5', True, 1),
('month_6', True, 1),
('month_7', False, 3),
('month_8', True, 1),
('month_9', True, 1),
('month_10', True, 1),
('month_11', False, 4),
('month_12', False, 5),
('mon', False, 6),
('tue', False, 7),
('wed', False, 9),
('thu', False, 10),
('fri', False, 8),
('sat', True, 1),
('w_mist', True, 1),
('w_light', True, 1)]
X_train.columns[~rfe.support_]
Index(['month_2', 'month_7', 'month_11', 'month_12', 'mon', 'tue', 'wed',
'thu', 'fri'],
dtype='object')
# selected list
col_list = X_train.columns[rfe.support_]
col_list
Index(['holiday', 'workingday', 'temp', 'atemp', 'hum', 'windspeed', 'summer',
'fall', 'winter', 'yr_2019', 'month_3', 'month_4', 'month_5', 'month_6',
'month_8', 'month_9', 'month_10', 'sat', 'w_mist', 'w_light'],
dtype='object')
# Create dataframe for the selction list
X_train_rfe = X_train[col_list]
X_train_rfe
| holiday | workingday | temp | atemp | hum | windspeed | summer | fall | winter | yr_2019 | month_3 | month_4 | month_5 | month_6 | month_8 | month_9 | month_10 | sat | w_mist | w_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 653 | 0 | 1 | 0.509887 | 0.501133 | 0.575354 | 0.300794 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 576 | 0 | 1 | 0.815169 | 0.766351 | 0.725633 | 0.264686 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 426 | 0 | 0 | 0.442393 | 0.438975 | 0.640189 | 0.255342 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 728 | 0 | 0 | 0.245101 | 0.200348 | 0.498067 | 0.663106 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 482 | 0 | 0 | 0.395666 | 0.391735 | 0.504508 | 0.188475 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 526 | 0 | 1 | 0.824514 | 0.762183 | 0.605840 | 0.355596 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 578 | 0 | 1 | 0.863973 | 0.824359 | 0.679690 | 0.187140 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 53 | 0 | 1 | 0.202618 | 0.218747 | 0.435939 | 0.111379 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 350 | 0 | 0 | 0.248216 | 0.223544 | 0.577930 | 0.431816 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 79 | 0 | 1 | 0.462664 | 0.434043 | 0.759870 | 0.529881 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
510 rows × 20 columns
# add constant
X_train_rfe = sm.add_constant(X_train_rfe)
# Create fitting model
lm = sm.OLS(y_train, X_train_rfe).fit()
#check summary
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.852
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 140.5
Date: Sun, 27 Aug 2023 Prob (F-statistic): 4.51e-188
Time: 16:05:18 Log-Likelihood: 525.37
No. Observations: 510 AIC: -1009.
Df Residuals: 489 BIC: -919.8
Df Model: 20
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1905 0.029 6.556 0.000 0.133 0.248
holiday -0.0490 0.027 -1.829 0.068 -0.102 0.004
workingday 0.0436 0.011 3.801 0.000 0.021 0.066
temp 0.4339 0.136 3.190 0.002 0.167 0.701
atemp 0.0322 0.137 0.236 0.814 -0.236 0.301
hum -0.1592 0.038 -4.220 0.000 -0.233 -0.085
windspeed -0.1826 0.026 -7.027 0.000 -0.234 -0.132
summer 0.0871 0.022 4.041 0.000 0.045 0.129
fall 0.0489 0.023 2.126 0.034 0.004 0.094
winter 0.1574 0.014 11.110 0.000 0.130 0.185
yr_2019 0.2310 0.008 28.926 0.000 0.215 0.247
month_3 0.0472 0.017 2.816 0.005 0.014 0.080
month_4 0.0440 0.026 1.717 0.087 -0.006 0.094
month_5 0.0679 0.026 2.606 0.009 0.017 0.119
month_6 0.0390 0.022 1.737 0.083 -0.005 0.083
month_8 0.0569 0.018 3.155 0.002 0.021 0.092
month_9 0.1240 0.017 7.144 0.000 0.090 0.158
month_10 0.0487 0.017 2.819 0.005 0.015 0.083
sat 0.0534 0.014 3.696 0.000 0.025 0.082
w_mist -0.0598 0.010 -5.791 0.000 -0.080 -0.040
w_light -0.2533 0.026 -9.686 0.000 -0.305 -0.202
==============================================================================
Omnibus: 82.801 Durbin-Watson: 2.030
Prob(Omnibus): 0.000 Jarque-Bera (JB): 233.627
Skew: -0.784 Prob(JB): 1.86e-51
Kurtosis: 5.922 Cond. No. 88.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 3 | temp | 61.71 |
| 0 | const | 55.37 |
| 4 | atemp | 55.03 |
| 8 | fall | 6.71 |
| 7 | summer | 5.63 |
| 13 | month_5 | 3.43 |
| 12 | month_4 | 3.11 |
| 9 | winter | 2.46 |
| 14 | month_6 | 2.28 |
| 5 | hum | 1.98 |
| 2 | workingday | 1.89 |
| 15 | month_8 | 1.85 |
| 18 | sat | 1.79 |
| 11 | month_3 | 1.63 |
| 19 | w_mist | 1.58 |
| 17 | month_10 | 1.54 |
| 16 | month_9 | 1.46 |
| 20 | w_light | 1.28 |
| 6 | windspeed | 1.27 |
| 1 | holiday | 1.17 |
| 10 | yr_2019 | 1.04 |
#dropping atemp column
X_train_new = X_train_rfe.drop(['atemp'], axis = 1)
# add constant
X_train_new = sm.add_constant(X_train_new)
# Create fitting model
lm = sm.OLS(y_train, X_train_new).fit()
#check summary
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.852
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 148.2
Date: Sun, 27 Aug 2023 Prob (F-statistic): 3.75e-189
Time: 16:05:18 Log-Likelihood: 525.34
No. Observations: 510 AIC: -1011.
Df Residuals: 490 BIC: -926.0
Df Model: 19
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1908 0.029 6.576 0.000 0.134 0.248
holiday -0.0493 0.027 -1.843 0.066 -0.102 0.003
workingday 0.0436 0.011 3.804 0.000 0.021 0.066
temp 0.4648 0.037 12.674 0.000 0.393 0.537
hum -0.1587 0.038 -4.217 0.000 -0.233 -0.085
windspeed -0.1839 0.025 -7.240 0.000 -0.234 -0.134
summer 0.0871 0.022 4.043 0.000 0.045 0.129
fall 0.0484 0.023 2.115 0.035 0.003 0.093
winter 0.1575 0.014 11.139 0.000 0.130 0.185
yr_2019 0.2310 0.008 28.956 0.000 0.215 0.247
month_3 0.0473 0.017 2.823 0.005 0.014 0.080
month_4 0.0443 0.026 1.734 0.084 -0.006 0.094
month_5 0.0678 0.026 2.605 0.009 0.017 0.119
month_6 0.0387 0.022 1.728 0.085 -0.005 0.083
month_8 0.0564 0.018 3.152 0.002 0.021 0.092
month_9 0.1240 0.017 7.150 0.000 0.090 0.158
month_10 0.0487 0.017 2.822 0.005 0.015 0.083
sat 0.0534 0.014 3.699 0.000 0.025 0.082
w_mist -0.0599 0.010 -5.804 0.000 -0.080 -0.040
w_light -0.2538 0.026 -9.744 0.000 -0.305 -0.203
==============================================================================
Omnibus: 82.440 Durbin-Watson: 2.030
Prob(Omnibus): 0.000 Jarque-Bera (JB): 232.258
Skew: -0.781 Prob(JB): 3.68e-51
Kurtosis: 5.914 Cond. No. 22.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 0 | const | 55.28 |
| 7 | fall | 6.65 |
| 6 | summer | 5.63 |
| 3 | temp | 4.50 |
| 12 | month_5 | 3.43 |
| 11 | month_4 | 3.10 |
| 8 | winter | 2.46 |
| 13 | month_6 | 2.27 |
| 4 | hum | 1.98 |
| 2 | workingday | 1.89 |
| 14 | month_8 | 1.83 |
| 17 | sat | 1.79 |
| 10 | month_3 | 1.63 |
| 18 | w_mist | 1.58 |
| 16 | month_10 | 1.54 |
| 15 | month_9 | 1.46 |
| 19 | w_light | 1.27 |
| 5 | windspeed | 1.22 |
| 1 | holiday | 1.17 |
| 9 | yr_2019 | 1.04 |
#dropping fall column
X_train_new = X_train_new.drop(['fall'], axis = 1)
# add constant
X_train_new = sm.add_constant(X_train_new)
# Create fitting model
lm = sm.OLS(y_train, X_train_new).fit()
lm.summary()
| Dep. Variable: | cnt | R-squared: | 0.850 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.845 |
| Method: | Least Squares | F-statistic: | 155.1 |
| Date: | Sun, 27 Aug 2023 | Prob (F-statistic): | 2.72e-189 |
| Time: | 16:05:18 | Log-Likelihood: | 523.03 |
| No. Observations: | 510 | AIC: | -1008. |
| Df Residuals: | 491 | BIC: | -927.6 |
| Df Model: | 18 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.1902 | 0.029 | 6.534 | 0.000 | 0.133 | 0.247 |
| holiday | -0.0523 | 0.027 | -1.949 | 0.052 | -0.105 | 0.000 |
| workingday | 0.0434 | 0.011 | 3.776 | 0.000 | 0.021 | 0.066 |
| temp | 0.5230 | 0.024 | 21.523 | 0.000 | 0.475 | 0.571 |
| hum | -0.1692 | 0.037 | -4.518 | 0.000 | -0.243 | -0.096 |
| windspeed | -0.1894 | 0.025 | -7.472 | 0.000 | -0.239 | -0.140 |
| summer | 0.0684 | 0.020 | 3.470 | 0.001 | 0.030 | 0.107 |
| winter | 0.1424 | 0.012 | 11.622 | 0.000 | 0.118 | 0.167 |
| yr_2019 | 0.2293 | 0.008 | 28.788 | 0.000 | 0.214 | 0.245 |
| month_3 | 0.0392 | 0.016 | 2.394 | 0.017 | 0.007 | 0.071 |
| month_4 | 0.0435 | 0.026 | 1.699 | 0.090 | -0.007 | 0.094 |
| month_5 | 0.0584 | 0.026 | 2.268 | 0.024 | 0.008 | 0.109 |
| month_6 | 0.0323 | 0.022 | 1.448 | 0.148 | -0.012 | 0.076 |
| month_8 | 0.0676 | 0.017 | 3.936 | 0.000 | 0.034 | 0.101 |
| month_9 | 0.1344 | 0.017 | 8.047 | 0.000 | 0.102 | 0.167 |
| month_10 | 0.0430 | 0.017 | 2.513 | 0.012 | 0.009 | 0.077 |
| sat | 0.0529 | 0.014 | 3.656 | 0.000 | 0.024 | 0.081 |
| w_mist | -0.0584 | 0.010 | -5.648 | 0.000 | -0.079 | -0.038 |
| w_light | -0.2480 | 0.026 | -9.540 | 0.000 | -0.299 | -0.197 |
| Omnibus: | 71.760 | Durbin-Watson: | 2.031 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 184.663 |
| Skew: | -0.710 | Prob(JB): | 7.96e-41 |
| Kurtosis: | 5.583 | Cond. No. | 21.1 |
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 0 | const | 55.28 |
| 6 | summer | 4.69 |
| 11 | month_5 | 3.33 |
| 10 | month_4 | 3.10 |
| 12 | month_6 | 2.23 |
| 3 | temp | 1.96 |
| 4 | hum | 1.94 |
| 2 | workingday | 1.89 |
| 7 | winter | 1.83 |
| 16 | sat | 1.79 |
| 13 | month_8 | 1.67 |
| 17 | w_mist | 1.57 |
| 9 | month_3 | 1.54 |
| 15 | month_10 | 1.51 |
| 14 | month_9 | 1.34 |
| 18 | w_light | 1.26 |
| 5 | windspeed | 1.21 |
| 1 | holiday | 1.17 |
| 8 | yr_2019 | 1.03 |
#dropping summer column
X_train_new = X_train_new.drop(['summer'], axis = 1)
# add constant
X_train_new = sm.add_constant(X_train_new)
# Create fitting model
lm = sm.OLS(y_train, X_train_new).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.847
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 159.9
Date: Sun, 27 Aug 2023 Prob (F-statistic): 7.83e-188
Time: 16:05:18 Log-Likelihood: 516.85
No. Observations: 510 AIC: -997.7
Df Residuals: 492 BIC: -921.5
Df Model: 17
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1856 0.029 6.313 0.000 0.128 0.243
holiday -0.0515 0.027 -1.899 0.058 -0.105 0.002
workingday 0.0439 0.012 3.775 0.000 0.021 0.067
temp 0.5190 0.025 21.146 0.000 0.471 0.567
hum -0.1602 0.038 -4.242 0.000 -0.234 -0.086
windspeed -0.1892 0.026 -7.381 0.000 -0.240 -0.139
winter 0.1417 0.012 11.437 0.000 0.117 0.166
yr_2019 0.2293 0.008 28.475 0.000 0.213 0.245
month_3 0.0638 0.015 4.278 0.000 0.034 0.093
month_4 0.1125 0.016 6.871 0.000 0.080 0.145
month_5 0.1267 0.017 7.571 0.000 0.094 0.160
month_6 0.0771 0.018 4.198 0.000 0.041 0.113
month_8 0.0685 0.017 3.945 0.000 0.034 0.103
month_9 0.1345 0.017 7.967 0.000 0.101 0.168
month_10 0.0431 0.017 2.491 0.013 0.009 0.077
sat 0.0544 0.015 3.715 0.000 0.026 0.083
w_mist -0.0577 0.010 -5.526 0.000 -0.078 -0.037
w_light -0.2482 0.026 -9.444 0.000 -0.300 -0.197
==============================================================================
Omnibus: 65.477 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 172.023
Skew: -0.643 Prob(JB): 4.42e-38
Kurtosis: 5.538 Cond. No. 20.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 0 | const | 55.16 |
| 3 | temp | 1.96 |
| 4 | hum | 1.93 |
| 2 | workingday | 1.89 |
| 6 | winter | 1.83 |
| 15 | sat | 1.79 |
| 12 | month_8 | 1.67 |
| 16 | w_mist | 1.57 |
| 14 | month_10 | 1.51 |
| 11 | month_6 | 1.48 |
| 10 | month_5 | 1.38 |
| 13 | month_9 | 1.34 |
| 17 | w_light | 1.26 |
| 8 | month_3 | 1.25 |
| 9 | month_4 | 1.24 |
| 5 | windspeed | 1.21 |
| 1 | holiday | 1.17 |
| 7 | yr_2019 | 1.03 |
#dropping holiday column
X_train_new = X_train_new.drop(['holiday'], axis = 1)
# add constant
X_train_new = sm.add_constant(X_train_new)
# Create fitting model
lm = sm.OLS(y_train, X_train_new).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.846
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 168.8
Date: Sun, 27 Aug 2023 Prob (F-statistic): 3.54e-188
Time: 16:05:18 Log-Likelihood: 514.99
No. Observations: 510 AIC: -996.0
Df Residuals: 493 BIC: -924.0
Df Model: 16
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1765 0.029 6.067 0.000 0.119 0.234
workingday 0.0515 0.011 4.705 0.000 0.030 0.073
temp 0.5199 0.025 21.132 0.000 0.472 0.568
hum -0.1592 0.038 -4.205 0.000 -0.234 -0.085
windspeed -0.1900 0.026 -7.395 0.000 -0.240 -0.140
winter 0.1413 0.012 11.378 0.000 0.117 0.166
yr_2019 0.2295 0.008 28.424 0.000 0.214 0.245
month_3 0.0652 0.015 4.368 0.000 0.036 0.095
month_4 0.1132 0.016 6.896 0.000 0.081 0.145
month_5 0.1280 0.017 7.638 0.000 0.095 0.161
month_6 0.0786 0.018 4.275 0.000 0.042 0.115
month_8 0.0694 0.017 3.989 0.000 0.035 0.104
month_9 0.1333 0.017 7.882 0.000 0.100 0.167
month_10 0.0444 0.017 2.559 0.011 0.010 0.078
sat 0.0621 0.014 4.403 0.000 0.034 0.090
w_mist -0.0575 0.010 -5.489 0.000 -0.078 -0.037
w_light -0.2475 0.026 -9.394 0.000 -0.299 -0.196
==============================================================================
Omnibus: 70.585 Durbin-Watson: 1.981
Prob(Omnibus): 0.000 Jarque-Bera (JB): 193.116
Skew: -0.679 Prob(JB): 1.16e-42
Kurtosis: 5.692 Cond. No. 20.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#create dataframe for all the features and their VIF's
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF' , ascending = False)
vif
| Features | VIF | |
|---|---|---|
| 0 | const | 53.68 |
| 2 | temp | 1.96 |
| 3 | hum | 1.93 |
| 5 | winter | 1.83 |
| 11 | month_8 | 1.67 |
| 1 | workingday | 1.66 |
| 14 | sat | 1.65 |
| 15 | w_mist | 1.57 |
| 13 | month_10 | 1.50 |
| 10 | month_6 | 1.48 |
| 9 | month_5 | 1.38 |
| 12 | month_9 | 1.34 |
| 16 | w_light | 1.26 |
| 7 | month_3 | 1.25 |
| 8 | month_4 | 1.23 |
| 4 | windspeed | 1.21 |
| 6 | yr_2019 | 1.03 |
VIF values are also low for all the features
#dropping temp column
X_train_new1 = X_train_new
X_train_new1 = X_train_new1.drop(['temp'], axis = 1)
# add constant
X_train_new1 = sm.add_constant(X_train_new1)
# Create fitting model
lm1 = sm.OLS(y_train, X_train_new1).fit()
print(lm1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.706
Model: OLS Adj. R-squared: 0.697
Method: Least Squares F-statistic: 79.03
Date: Sun, 27 Aug 2023 Prob (F-statistic): 1.21e-120
Time: 16:05:18 Log-Likelihood: 350.54
No. Observations: 510 AIC: -669.1
Df Residuals: 494 BIC: -601.3
Df Model: 15
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3186 0.039 8.164 0.000 0.242 0.395
workingday 0.0585 0.015 3.880 0.000 0.029 0.088
hum -0.0013 0.051 -0.026 0.979 -0.102 0.099
windspeed -0.2249 0.035 -6.361 0.000 -0.294 -0.155
winter 0.1028 0.017 6.066 0.000 0.069 0.136
yr_2019 0.2502 0.011 22.638 0.000 0.228 0.272
month_3 0.0597 0.021 2.900 0.004 0.019 0.100
month_4 0.1521 0.022 6.763 0.000 0.108 0.196
month_5 0.2294 0.022 10.354 0.000 0.186 0.273
month_6 0.2495 0.023 10.954 0.000 0.205 0.294
month_8 0.2488 0.021 11.877 0.000 0.208 0.290
month_9 0.2618 0.022 12.024 0.000 0.219 0.305
month_10 0.1170 0.023 4.992 0.000 0.071 0.163
sat 0.0635 0.019 3.268 0.001 0.025 0.102
w_mist -0.1056 0.014 -7.493 0.000 -0.133 -0.078
w_light -0.2905 0.036 -8.019 0.000 -0.362 -0.219
==============================================================================
Omnibus: 21.888 Durbin-Watson: 1.858
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.367
Skew: 0.131 Prob(JB): 2.58e-12
Kurtosis: 4.563 Cond. No. 19.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
After removing temp column, there is huge impact on the R-squared and Adj. R- value so revering the changes back and no more dropping of columns
# Final Model
# add constant
X_train_new = sm.add_constant(X_train_new)
# Create fitting model
lm = sm.OLS(y_train, X_train_new).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cnt R-squared: 0.846
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 168.8
Date: Sun, 27 Aug 2023 Prob (F-statistic): 3.54e-188
Time: 16:05:18 Log-Likelihood: 514.99
No. Observations: 510 AIC: -996.0
Df Residuals: 493 BIC: -924.0
Df Model: 16
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.1765 0.029 6.067 0.000 0.119 0.234
workingday 0.0515 0.011 4.705 0.000 0.030 0.073
temp 0.5199 0.025 21.132 0.000 0.472 0.568
hum -0.1592 0.038 -4.205 0.000 -0.234 -0.085
windspeed -0.1900 0.026 -7.395 0.000 -0.240 -0.140
winter 0.1413 0.012 11.378 0.000 0.117 0.166
yr_2019 0.2295 0.008 28.424 0.000 0.214 0.245
month_3 0.0652 0.015 4.368 0.000 0.036 0.095
month_4 0.1132 0.016 6.896 0.000 0.081 0.145
month_5 0.1280 0.017 7.638 0.000 0.095 0.161
month_6 0.0786 0.018 4.275 0.000 0.042 0.115
month_8 0.0694 0.017 3.989 0.000 0.035 0.104
month_9 0.1333 0.017 7.882 0.000 0.100 0.167
month_10 0.0444 0.017 2.559 0.011 0.010 0.078
sat 0.0621 0.014 4.403 0.000 0.034 0.090
w_mist -0.0575 0.010 -5.489 0.000 -0.078 -0.037
w_light -0.2475 0.026 -9.394 0.000 -0.299 -0.196
==============================================================================
Omnibus: 70.585 Durbin-Watson: 1.981
Prob(Omnibus): 0.000 Jarque-Bera (JB): 193.116
Skew: -0.679 Prob(JB): 1.16e-42
Kurtosis: 5.692 Cond. No. 20.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1.Final $ R^2 $ is 0.84 (decent) , Adj. $ R^2 $ is 0.84(decent), prob. F stats is less than 0.05 (overall model fit is significant) which is good fit
2.The Top 3 Features : temp, yr_2019, windspeed
(writting eqn only with top 3 - we need to include rest as well)
cnt = 0.1765 + (0.5199 * temp) + (0.2995 * yr_2019) - (0.19 * windspeed)......
# Predict the value (y_train_cnt)
y_train_cnt = lm.predict(X_train_new)
# plot histogram for error term
fig = plt.figure()
sns.distplot((y_train-y_train_cnt), bins = 20)
fig.suptitle("Error Term", fontsize = 20)
plt.xlabel("error",fontsize = 10)
Text(0.5, 0, 'error')
col_list = X_train_new.columns
col_list
Index(['const', 'workingday', 'temp', 'hum', 'windspeed', 'winter', 'yr_2019',
'month_3', 'month_4', 'month_5', 'month_6', 'month_8', 'month_9',
'month_10', 'sat', 'w_mist', 'w_light'],
dtype='object')
col_list = ['workingday', 'temp', 'hum', 'windspeed', 'winter', 'yr_2019',
'month_3', 'month_4', 'month_5', 'month_6', 'month_8', 'month_9',
'month_10', 'sat', 'w_mist', 'w_light']
# apply scalar for test data for numeric variables
df_test[num_vars] = scaler.transform(df_test[num_vars])
y_test = df_test.pop('cnt')
# maintain same columns as in final model dataset
X_test = df_test[col_list]
#add constant
X_test = sm.add_constant(X_test)
#Making Prediction
y_pred = lm.predict(X_test)
# ploting y_test and y_pred (predicted values) to understand the spread
fig = plt.figure()
plt.scatter(y_test,y_pred)
#plot title
fig.suptitle("y_test and y_pred plot", fontsize = 20)
# X label
plt.xlabel('y_test', fontsize = 10)
#Y label
plt.ylabel('y_pred', fontsize = 10)
plt.show()
Normal distribution evident
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
0.8186483411107541
Model is fine/fit